Frequency of Movie Ratings and the Stock Market¶
Motivation¶
While exploring user activity in the MovieLens dataset, I noticed something interesting:
the 4-week rolling volume of movie ratings appeared to have a strong negative correlation with the S&P 500 index.
This raised a natural question:
Are people more likely to rate movies when the stock market is down?
I thought this was worth exploring, as it would be interesting to find quantitative evidence that economic anxiety or reduced social activity might drive increased media consumption.
The analysis below begins with this observation: we merge timestamped MovieLens rating data with historical S&P 500 values, compute rolling rating volumes, and examine the relationship. While the initial correlation appears compelling, we’ll eventually uncover a more mundane — and more important — explanation.
Setup: imports¶
import pandas as pd
import numpy as np
from datetime import datetime
import pandas_datareader.data as web
from scipy.stats import pearsonr
import plotly.express as px
from sklearn.linear_model import LinearRegression
from config.settings import DB_PATH, PROCESSED_DATA_PATH
Load MovieLens Rating Volume and S&P 500 Index¶
We load two preprocessed datasets:
- Daily forward-looking 4-week rating volumes from the MovieLens dataset
- Daily S&P 500 index values from the Federal Reserve Economic Data (FRED) API
The daily forward-looking 4-week rating volume refers to the total number of ratings made in the 28 days following each day in the dataset. For example, the value associated with January 1, 2020, reflects the total number of movie ratings submitted between January 1 and January 28 (inclusive). This gives a sense of how active users were in the month following a given day.
The rating volume data is loaded from the file daily_forward_4w_rating_volume.parquet. The preprocessing steps used to generate this file are available in the src/data_processing/ directory of this project.
# Load 4-week forward rating volumes
ratings = pd.read_parquet(PROCESSED_DATA_PATH / "daily_forward_4w_rating_volume.parquet")
# Load S&P 500 index
start_date = '2016-01-01'
end_date = '2025-01-01'
sp500_df = web.DataReader('SP500', 'fred', start_date, end_date)
sp500_df = sp500_df.reset_index() # Reset index so 'DATE' becomes a column
sp500_df.rename(columns={"DATE": "timestamp", "SP500": "sp500_index"}, inplace=True)
The Initial Correlation¶
We now merge the two datasets on date and compute the Pearson correlation between them.
# Merge
merged = pd.merge(ratings, sp500_df, on="timestamp")
merged = merged.dropna(subset=["forward_4w_volume", "sp500_index"])
# Compute correlation
corr, _ = pearsonr(merged["forward_4w_volume"], merged["sp500_index"])
print(f"Pearson correlation (4-week volume vs S&P 500): {corr:.2f}")
Pearson correlation (4-week volume vs S&P 500): -0.74
The initial Pearson correlation is strong, suggesting that as the stock market rises, movie rating activity tends to fall — and vice versa.
But is this relationship meaningful? Or is it just an artifact of broader trends over time?
Let’s dig deeper.
Exploration¶
To explore the relationship between movie rating volume and the S&P 500 index, we begin by visualizing the raw data: weekly rating volume alongside the S&P 500 index. Then, we'll investigate how the correlation changes when we change the size of the rolling time window — from 1 week up to 9 weeks. This turns out to be a key step toward our understanding the nature of the relationship.
⚠️ Detour: COVID-19 Effects
During the analysis, we also looked at whether the relationship changed during the COVID-19 pandemic. While the correlation does vary sharply before, during, and after the pandemic, this ended up being a side path — not central to the main finding.
If you're curious, that exploration is included in a later appendix-style section.
merged["week_date"] = merged["timestamp"].dt.strftime("%Y-%m-%d") # Add formatted week_date for hover
fig = px.scatter(
merged,
x="sp500_index",
y="forward_4w_volume",
hover_data={"week_date": True, "forward_4w_volume": False, "sp500_index": False},
title="Weekly Rating Volume vs S&P 500 Index",
labels={"forward_4w_volume": "4-Week Rating Volume", "sp500_index": "S&P 500 Index"}
)
fig.update_traces(marker=dict(size=8, opacity=0.7))
fig.update_layout(height=500)
fig.show()
A clear negative correlation is visible — when the S&P 500 goes up, rating volume tends to go down.
This prompted a natural follow-up question: does the correlation get stronger if we look over different time periods? We started with a forward-looking 4-week rating volume because we thought the effects of changes in the stock market might best be seen over the following month, but we didn't have anything to support this idea. For all we know the effect of changes in the stock market might be best seen over a shorter or longer time frame. So, instead of comparing S&P 500 vs. 4-week rating volume, what happens if we use a 1-week, 5-week, or even 9-week total volume?
Let’s see how that correlation changes as the time frame increases from 1 to 9 weeks.
# Load n-week forward rating volumes for n=1,2,...,9
ratings_multiweek_df = pd.read_parquet(PROCESSED_DATA_PATH / "daily_forward_multiweek_rating_volume.parquet")
# Merge
merged_multiweek = pd.merge(ratings_multiweek_df, sp500_df, on="timestamp")
# Compute correlations
for n in range(1, 10):
merged_multiweek = merged_multiweek.dropna(subset=[f"forward_{n}w_volume", "sp500_index"])
corr, pval = pearsonr(merged_multiweek["sp500_index"], merged_multiweek[f"forward_{n}w_volume"])
print(f"Pearson correlation for {n}-week range: {corr:.4f}")
Pearson correlation for 1-week range: -0.6577 Pearson correlation for 2-week range: -0.7080 Pearson correlation for 3-week range: -0.7334 Pearson correlation for 4-week range: -0.7497 Pearson correlation for 5-week range: -0.7630 Pearson correlation for 6-week range: -0.7721 Pearson correlation for 7-week range: -0.7800 Pearson correlation for 8-week range: -0.7860 Pearson correlation for 9-week range: -0.7909
A New Hypothesis¶
As seen above, when the rolling window expands, the correlation between rating volume and the S&P 500 index grows steadily stronger — from -0.66 for a 1-week window to -0.79 over a 9-week window.
This raises a new possibility:
Maybe the increasing correlation isn't because the S&P 500 is affecting movie rating volume, but because both variables are independently trending over time.
In other words, the relationship may be spurious — an artifact of their shared dependence on time.
To test this idea, we'll next examine how each variable correlates with time, and then remove that trend to see what remains.
Investigation: Correlation with Time and Detrending¶
We first compute the Pearson correlation between time and each of the two variables:
- S&P 500 index
- Movie rating volume
Then, we fit a simple linear trend for each variable and subtract it (detrending). Finally, we compute the correlation between the detrended series.
# Use days since start as a numeric time feature
merged["days_since_start"] = (merged["timestamp"] - merged["timestamp"].min()).dt.days
X_time = merged["days_since_start"].values.reshape(-1, 1)
# Correlation with time
sp_corr_with_time, _ = pearsonr(merged["sp500_index"], merged["days_since_start"])
vol_corr_with_time, _ = pearsonr(merged["forward_4w_volume"], merged["days_since_start"])
print(f"S&P 500 vs time: {sp_corr_with_time:.2f}")
print(f"Rating volume vs time: {vol_corr_with_time:.2f}")
# Detrend both series
def detrend(y, X):
model = LinearRegression().fit(X, y)
trend = model.predict(X)
return y - trend
merged["sp500_detrended"] = detrend(merged["sp500_index"], X_time)
merged["volume_detrended"] = detrend(merged["forward_4w_volume"], X_time)
# Correlation after detrending
detrended_corr, _ = pearsonr(merged["sp500_detrended"], merged["volume_detrended"])
print(f"Correlation after detrending: {detrended_corr:.2f}")
S&P 500 vs time: 0.93 Rating volume vs time: -0.74 Correlation after detrending: -0.23
Conclusion¶
Although rating volume and the S&P 500 index show a moderately strong negative correlation at first glance, this relationship weakens substantially once we remove their shared trend over time.
After detrending both variables, the correlation drops from -0.74 (for the 4-week window) to just -0.23.
The apparent relationship was driven by temporal trends. After detrending, little correlation remains.
This highlights an important principle in data science:
Always check whether correlations may be driven by external or confounding variables, such as time.
In this case, a visually compelling relationship turned out to be a classic example of a spurious correlation — a false signal caused by coincident trends.
Appendix: A COVID-Era Detour¶
During this project, we also explored whether the relationship between rating volume and the S&P 500 changed during the COVID-19 pandemic.
This turned out to be a detour — interesting, but not central to the main story. Still, the sharp differences across eras are worth noting.
We divided the data into three periods:
- Pre-COVID: before March 2020
- COVID-era: March 2020 to June 2021
- Post-COVID: after June 2021
Below are the correlations within each period.
# Define date boundaries
covid_start = pd.Timestamp("2020-03-01")
covid_end = pd.Timestamp("2021-06-30")
# Pre-COVID
pre_covid = merged[merged["timestamp"] < covid_start].dropna(subset=["forward_4w_volume", "sp500_index"])
pre_corr, _ = pearsonr(pre_covid["sp500_index"], pre_covid["forward_4w_volume"])
# COVID-era
covid_period = merged[
(merged["timestamp"] >= covid_start) &
(merged["timestamp"] <= covid_end)
].dropna(subset=["forward_4w_volume", "sp500_index"])
covid_corr, _ = pearsonr(covid_period["sp500_index"], covid_period["forward_4w_volume"])
# Post-COVID
post_covid = merged[merged["timestamp"] > covid_end].dropna(subset=["forward_4w_volume", "sp500_index"])
post_corr, _ = pearsonr(post_covid["sp500_index"], post_covid["forward_4w_volume"])
# Display results
print(f"Pre-COVID Pearson correlation: {pre_corr:.2f}")
print(f"COVID-era Pearson correlation: {covid_corr:.2f}")
print(f"Post-COVID Pearson correlation: {post_corr:.2f}")
Pre-COVID Pearson correlation: -0.75 COVID-era Pearson correlation: -0.84 Post-COVID Pearson correlation: 0.50
These sharp shifts suggest that the relationship between user rating activity and the stock market was not consistent over time. During the COVID lockdown period, people likely watched and rated more movies while the S&P 500 was volatile, leading to a stronger negative correlation. After reopening, that earlier correlation disappeared.
However, rather than pointing to a robust economic signal, these findings highlight how external events and social context can produce spurious patterns. This reinforces our broader conclusion: the overall negative correlation is not stable and likely not meaningful.
To visualize these shifts, we modify the scatter plot above, coloring each data point by COVID era. You can toggle each era on or off to explore the patterns.
# Format timestamp for hover
merged['date'] = merged['timestamp'].dt.strftime('%Y-%m-%d')
# Label each row by era
merged["era"] = pd.cut(
merged["timestamp"],
bins=[
pd.Timestamp("2000-01-01"), # early start
covid_start,
covid_end,
merged["timestamp"].max() + pd.Timedelta(days=1)
],
labels=["Pre-COVID", "COVID-era", "Post-COVID"]
)
# Drop missing values
filtered = merged.dropna(subset=["forward_4w_volume", "sp500_index", "era"])
# Create plot
fig = px.scatter(
filtered,
y="forward_4w_volume",
x="sp500_index",
color="era",
hover_data={"date": True, "forward_4w_volume": False, "sp500_index": False, "era": False},
title="4-Week Rating Volume vs S&P 500 Index by Era",
labels={
"forward_4w_volume": "4-Week Rating Volume",
"sp500": "S&P 500 Index",
"era": "Era"
}
)
fig.update_traces(marker=dict(size=8, opacity=0.7))
fig.update_layout(height=500, legend_title = "Click to filter eras")
fig.show()